{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Normalization\n",
    "\n",
    "Normalization is a crucial process for ensuring optimal performance and efficient management of data in file-based database systems like ParquetDB. In traditional databases, normalization typically refers to structuring relational tables to reduce redundancy. In ParquetDB’s context, **normalization** helps balance the distribution of data across multiple Parquet files, avoiding situations where files have uneven row counts. \n",
    "\n",
    "Without proper normalization, data skew can lead to performance bottlenecks in operations such as queries, inserts, updates, or deletions. By normalizing your dataset, ParquetDB rewrites and restructures your files to have a more consistent number of rows, improving parallelization and read/write speeds. \n",
    "\n",
    "In this notebook, we will:\n",
    "1. Generate an example dataset using `generate_similar_data`.\n",
    "2. Demonstrate how to normalize data in ParquetDB using `NormalizeConfig`.\n",
    "3. Show how normalization can improve performance by ensuring each file has a balanced distribution of rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pprint\n",
    "import os\n",
    "import shutil\n",
    "from parquetdb.utils.general_utils import generate_similar_data\n",
    "from parquetdb import ParquetDB, NormalizeConfig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generate Example Data\n",
    "\n",
    "Below, we’ll generate a dataset that imitates real-world data variations using the `generate_similar_data` utility function. This function creates new data entries based on the structure of a provided template.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated Data:\n",
      "{'column_0': 'test_74',\n",
      " 'column_1': 'test_16',\n",
      " 'column_10': 'test_64',\n",
      " 'column_100': 'test_84',\n",
      " 'column_101': 'test_76',\n",
      " 'column_102': 'test_78',\n",
      " 'column_103': 'test_26',\n",
      " 'column_104': 'test_16',\n",
      " 'column_105': 'test_70',\n",
      " 'column_106': 'test_19',\n",
      " 'column_107': 'test_5',\n",
      " 'column_108': 'test_82',\n",
      " 'column_109': 'test_54',\n",
      " 'column_11': 'test_72',\n",
      " 'column_110': 'test_94',\n",
      " 'column_111': 'test_59',\n",
      " 'column_112': 'test_31',\n",
      " 'column_113': 'test_33',\n",
      " 'column_114': 'test_30',\n",
      " 'column_115': 'test_82',\n",
      " 'column_116': 'test_2',\n",
      " 'column_117': 'test_20',\n",
      " 'column_118': 'test_82',\n",
      " 'column_119': 'test_31',\n",
      " 'column_12': 'test_26',\n",
      " 'column_120': 'test_89',\n",
      " 'column_121': 'test_84',\n",
      " 'column_122': 'test_85',\n",
      " 'column_123': 'test_81',\n",
      " 'column_124': 'test_37',\n",
      " 'column_125': 'test_100',\n",
      " 'column_126': 'test_95',\n",
      " 'column_127': 'test_54',\n",
      " 'column_128': 'test_86',\n",
      " 'column_129': 'test_82',\n",
      " 'column_13': 'test_5',\n",
      " 'column_130': 'test_62',\n",
      " 'column_131': 'test_94',\n",
      " 'column_132': 'test_87',\n",
      " 'column_133': 'test_99',\n",
      " 'column_134': 'test_73',\n",
      " 'column_135': 'test_56',\n",
      " 'column_136': 'test_98',\n",
      " 'column_137': 'test_36',\n",
      " 'column_138': 'test_4',\n",
      " 'column_139': 'test_88',\n",
      " 'column_14': 'test_1',\n",
      " 'column_140': 'test_71',\n",
      " 'column_141': 'test_35',\n",
      " 'column_142': 'test_71',\n",
      " 'column_143': 'test_76',\n",
      " 'column_144': 'test_55',\n",
      " 'column_145': 'test_55',\n",
      " 'column_146': 'test_15',\n",
      " 'column_147': 'test_99',\n",
      " 'column_148': 'test_80',\n",
      " 'column_149': 'test_61',\n",
      " 'column_15': 'test_91',\n",
      " 'column_150': 'test_33',\n",
      " 'column_151': 'test_58',\n",
      " 'column_152': 'test_75',\n",
      " 'column_153': 'test_51',\n",
      " 'column_154': 'test_46',\n",
      " 'column_155': 'test_20',\n",
      " 'column_156': 'test_44',\n",
      " 'column_157': 'test_49',\n",
      " 'column_158': 'test_51',\n",
      " 'column_159': 'test_2',\n",
      " 'column_16': 'test_4',\n",
      " 'column_160': 'test_25',\n",
      " 'column_161': 'test_29',\n",
      " 'column_162': 'test_54',\n",
      " 'column_163': 'test_3',\n",
      " 'column_164': 'test_23',\n",
      " 'column_165': 'test_84',\n",
      " 'column_166': 'test_41',\n",
      " 'column_167': 'test_45',\n",
      " 'column_168': 'test_65',\n",
      " 'column_169': 'test_34',\n",
      " 'column_17': 'test_99',\n",
      " 'column_170': 'test_60',\n",
      " 'column_171': 'test_57',\n",
      " 'column_172': 'test_99',\n",
      " 'column_173': 'test_67',\n",
      " 'column_174': 'test_25',\n",
      " 'column_175': 'test_97',\n",
      " 'column_176': 'test_62',\n",
      " 'column_177': 'test_30',\n",
      " 'column_178': 'test_21',\n",
      " 'column_179': 'test_70',\n",
      " 'column_18': 'test_41',\n",
      " 'column_180': 'test_59',\n",
      " 'column_181': 'test_15',\n",
      " 'column_182': 'test_67',\n",
      " 'column_183': 'test_20',\n",
      " 'column_184': 'test_41',\n",
      " 'column_185': 'test_41',\n",
      " 'column_186': 'test_42',\n",
      " 'column_187': 'test_60',\n",
      " 'column_188': 'test_51',\n",
      " 'column_189': 'test_93',\n",
      " 'column_19': 'test_62',\n",
      " 'column_190': 'test_16',\n",
      " 'column_191': 'test_60',\n",
      " 'column_192': 'test_32',\n",
      " 'column_193': 'test_94',\n",
      " 'column_194': 'test_56',\n",
      " 'column_195': 'test_91',\n",
      " 'column_196': 'test_29',\n",
      " 'column_197': 'test_65',\n",
      " 'column_198': 'test_3',\n",
      " 'column_199': 'test_5',\n",
      " 'column_2': 'test_36',\n",
      " 'column_20': 'test_12',\n",
      " 'column_200': 'test_43',\n",
      " 'column_201': 'test_25',\n",
      " 'column_202': 'test_25',\n",
      " 'column_203': 'test_31',\n",
      " 'column_204': 'test_70',\n",
      " 'column_205': 'test_5',\n",
      " 'column_206': 'test_24',\n",
      " 'column_207': 'test_8',\n",
      " 'column_208': 'test_6',\n",
      " 'column_209': 'test_81',\n",
      " 'column_21': 'test_66',\n",
      " 'column_210': 'test_88',\n",
      " 'column_211': 'test_7',\n",
      " 'column_212': 'test_98',\n",
      " 'column_213': 'test_83',\n",
      " 'column_214': 'test_5',\n",
      " 'column_215': 'test_83',\n",
      " 'column_216': 'test_62',\n",
      " 'column_217': 'test_70',\n",
      " 'column_218': 'test_6',\n",
      " 'column_219': 'test_42',\n",
      " 'column_22': 'test_44',\n",
      " 'column_220': 'test_68',\n",
      " 'column_221': 'test_57',\n",
      " 'column_222': 'test_93',\n",
      " 'column_223': 'test_5',\n",
      " 'column_224': 'test_20',\n",
      " 'column_225': 'test_8',\n",
      " 'column_226': 'test_85',\n",
      " 'column_227': 'test_70',\n",
      " 'column_228': 'test_72',\n",
      " 'column_229': 'test_9',\n",
      " 'column_23': 'test_98',\n",
      " 'column_230': 'test_60',\n",
      " 'column_231': 'test_76',\n",
      " 'column_232': 'test_88',\n",
      " 'column_233': 'test_20',\n",
      " 'column_234': 'test_42',\n",
      " 'column_235': 'test_33',\n",
      " 'column_236': 'test_63',\n",
      " 'column_237': 'test_78',\n",
      " 'column_238': 'test_21',\n",
      " 'column_239': 'test_11',\n",
      " 'column_24': 'test_87',\n",
      " 'column_240': 'test_79',\n",
      " 'column_241': 'test_25',\n",
      " 'column_242': 'test_82',\n",
      " 'column_243': 'test_70',\n",
      " 'column_244': 'test_77',\n",
      " 'column_245': 'test_4',\n",
      " 'column_246': 'test_30',\n",
      " 'column_247': 'test_13',\n",
      " 'column_248': 'test_29',\n",
      " 'column_249': 'test_24',\n",
      " 'column_25': 'test_42',\n",
      " 'column_250': 'test_37',\n",
      " 'column_251': 'test_77',\n",
      " 'column_252': 'test_53',\n",
      " 'column_253': 'test_52',\n",
      " 'column_254': 'test_26',\n",
      " 'column_255': 'test_43',\n",
      " 'column_256': 'test_3',\n",
      " 'column_257': 'test_4',\n",
      " 'column_258': 'test_23',\n",
      " 'column_259': 'test_55',\n",
      " 'column_26': 'test_39',\n",
      " 'column_260': 'test_66',\n",
      " 'column_261': 'test_47',\n",
      " 'column_262': 'test_24',\n",
      " 'column_263': 'test_16',\n",
      " 'column_264': 'test_78',\n",
      " 'column_265': 'test_49',\n",
      " 'column_266': 'test_36',\n",
      " 'column_267': 'test_74',\n",
      " 'column_268': 'test_31',\n",
      " 'column_269': 'test_62',\n",
      " 'column_27': 'test_66',\n",
      " 'column_270': 'test_27',\n",
      " 'column_271': 'test_2',\n",
      " 'column_272': 'test_71',\n",
      " 'column_273': 'test_64',\n",
      " 'column_274': 'test_33',\n",
      " 'column_275': 'test_83',\n",
      " 'column_276': 'test_73',\n",
      " 'column_277': 'test_44',\n",
      " 'column_278': 'test_25',\n",
      " 'column_279': 'test_79',\n",
      " 'column_28': 'test_77',\n",
      " 'column_280': 'test_84',\n",
      " 'column_281': 'test_16',\n",
      " 'column_282': 'test_42',\n",
      " 'column_283': 'test_69',\n",
      " 'column_284': 'test_76',\n",
      " 'column_285': 'test_94',\n",
      " 'column_286': 'test_75',\n",
      " 'column_287': 'test_81',\n",
      " 'column_288': 'test_55',\n",
      " 'column_289': 'test_63',\n",
      " 'column_29': 'test_45',\n",
      " 'column_290': 'test_12',\n",
      " 'column_291': 'test_71',\n",
      " 'column_292': 'test_9',\n",
      " 'column_293': 'test_5',\n",
      " 'column_294': 'test_9',\n",
      " 'column_295': 'test_44',\n",
      " 'column_296': 'test_33',\n",
      " 'column_297': 'test_70',\n",
      " 'column_298': 'test_97',\n",
      " 'column_299': 'test_84',\n",
      " 'column_3': 'test_28',\n",
      " 'column_30': 'test_91',\n",
      " 'column_300': 'test_64',\n",
      " 'column_301': 'test_97',\n",
      " 'column_302': 'test_30',\n",
      " 'column_303': 'test_28',\n",
      " 'column_304': 'test_70',\n",
      " 'column_305': 'test_60',\n",
      " 'column_306': 'test_44',\n",
      " 'column_307': 'test_7',\n",
      " 'column_308': 'test_14',\n",
      " 'column_309': 'test_50',\n",
      " 'column_31': 'test_10',\n",
      " 'column_310': 'test_72',\n",
      " 'column_311': 'test_62',\n",
      " 'column_312': 'test_95',\n",
      " 'column_313': 'test_68',\n",
      " 'column_314': 'test_77',\n",
      " 'column_315': 'test_52',\n",
      " 'column_316': 'test_40',\n",
      " 'column_317': 'test_27',\n",
      " 'column_318': 'test_6',\n",
      " 'column_319': 'test_47',\n",
      " 'column_32': 'test_75',\n",
      " 'column_320': 'test_78',\n",
      " 'column_321': 'test_98',\n",
      " 'column_322': 'test_24',\n",
      " 'column_323': 'test_7',\n",
      " 'column_324': 'test_74',\n",
      " 'column_325': 'test_4',\n",
      " 'column_326': 'test_51',\n",
      " 'column_327': 'test_69',\n",
      " 'column_328': 'test_25',\n",
      " 'column_329': 'test_33',\n",
      " 'column_33': 'test_15',\n",
      " 'column_330': 'test_68',\n",
      " 'column_331': 'test_14',\n",
      " 'column_332': 'test_12',\n",
      " 'column_333': 'test_27',\n",
      " 'column_334': 'test_85',\n",
      " 'column_335': 'test_41',\n",
      " 'column_336': 'test_92',\n",
      " 'column_337': 'test_73',\n",
      " 'column_338': 'test_66',\n",
      " 'column_339': 'test_92',\n",
      " 'column_34': 'test_2',\n",
      " 'column_340': 'test_93',\n",
      " 'column_341': 'test_68',\n",
      " 'column_342': 'test_36',\n",
      " 'column_343': 'test_35',\n",
      " 'column_344': 'test_78',\n",
      " 'column_345': 'test_44',\n",
      " 'column_346': 'test_55',\n",
      " 'column_347': 'test_87',\n",
      " 'column_348': 'test_33',\n",
      " 'column_349': 'test_80',\n",
      " 'column_35': 'test_52',\n",
      " 'column_350': 'test_81',\n",
      " 'column_351': 'test_8',\n",
      " 'column_352': 'test_52',\n",
      " 'column_353': 'test_73',\n",
      " 'column_354': 'test_23',\n",
      " 'column_355': 'test_10',\n",
      " 'column_356': 'test_96',\n",
      " 'column_357': 'test_25',\n",
      " 'column_358': 'test_33',\n",
      " 'column_359': 'test_5',\n",
      " 'column_36': 'test_1',\n",
      " 'column_360': 'test_2',\n",
      " 'column_361': 'test_67',\n",
      " 'column_362': 'test_30',\n",
      " 'column_363': 'test_23',\n",
      " 'column_364': 'test_92',\n",
      " 'column_365': 'test_12',\n",
      " 'column_366': 'test_5',\n",
      " 'column_367': 'test_43',\n",
      " 'column_368': 'test_94',\n",
      " 'column_369': 'test_21',\n",
      " 'column_37': 'test_92',\n",
      " 'column_370': 'test_29',\n",
      " 'column_371': 'test_92',\n",
      " 'column_372': 'test_11',\n",
      " 'column_373': 'test_50',\n",
      " 'column_374': 'test_90',\n",
      " 'column_375': 'test_3',\n",
      " 'column_376': 'test_45',\n",
      " 'column_377': 'test_78',\n",
      " 'column_378': 'test_58',\n",
      " 'column_379': 'test_47',\n",
      " 'column_38': 'test_70',\n",
      " 'column_380': 'test_8',\n",
      " 'column_381': 'test_94',\n",
      " 'column_382': 'test_74',\n",
      " 'column_383': 'test_98',\n",
      " 'column_384': 'test_7',\n",
      " 'column_385': 'test_49',\n",
      " 'column_386': 'test_47',\n",
      " 'column_387': 'test_38',\n",
      " 'column_388': 'test_16',\n",
      " 'column_389': 'test_70',\n",
      " 'column_39': 'test_50',\n",
      " 'column_390': 'test_28',\n",
      " 'column_391': 'test_18',\n",
      " 'column_392': 'test_70',\n",
      " 'column_393': 'test_72',\n",
      " 'column_394': 'test_59',\n",
      " 'column_395': 'test_1',\n",
      " 'column_396': 'test_56',\n",
      " 'column_397': 'test_1',\n",
      " 'column_398': 'test_11',\n",
      " 'column_399': 'test_28',\n",
      " 'column_4': 'test_33',\n",
      " 'column_40': 'test_47',\n",
      " 'column_400': 'test_72',\n",
      " 'column_401': 'test_8',\n",
      " 'column_402': 'test_7',\n",
      " 'column_403': 'test_52',\n",
      " 'column_404': 'test_13',\n",
      " 'column_405': 'test_61',\n",
      " 'column_406': 'test_58',\n",
      " 'column_407': 'test_42',\n",
      " 'column_408': 'test_82',\n",
      " 'column_409': 'test_58',\n",
      " 'column_41': 'test_48',\n",
      " 'column_410': 'test_51',\n",
      " 'column_411': 'test_46',\n",
      " 'column_412': 'test_81',\n",
      " 'column_413': 'test_49',\n",
      " 'column_414': 'test_83',\n",
      " 'column_415': 'test_88',\n",
      " 'column_416': 'test_16',\n",
      " 'column_417': 'test_76',\n",
      " 'column_418': 'test_42',\n",
      " 'column_419': 'test_30',\n",
      " 'column_42': 'test_56',\n",
      " 'column_420': 'test_11',\n",
      " 'column_421': 'test_71',\n",
      " 'column_422': 'test_47',\n",
      " 'column_423': 'test_28',\n",
      " 'column_424': 'test_95',\n",
      " 'column_425': 'test_54',\n",
      " 'column_426': 'test_47',\n",
      " 'column_427': 'test_55',\n",
      " 'column_428': 'test_11',\n",
      " 'column_429': 'test_91',\n",
      " 'column_43': 'test_18',\n",
      " 'column_430': 'test_67',\n",
      " 'column_431': 'test_86',\n",
      " 'column_432': 'test_35',\n",
      " 'column_433': 'test_90',\n",
      " 'column_434': 'test_88',\n",
      " 'column_435': 'test_61',\n",
      " 'column_436': 'test_18',\n",
      " 'column_437': 'test_100',\n",
      " 'column_438': 'test_22',\n",
      " 'column_439': 'test_25',\n",
      " 'column_44': 'test_36',\n",
      " 'column_440': 'test_6',\n",
      " 'column_441': 'test_35',\n",
      " 'column_442': 'test_41',\n",
      " 'column_443': 'test_91',\n",
      " 'column_444': 'test_47',\n",
      " 'column_445': 'test_25',\n",
      " 'column_446': 'test_87',\n",
      " 'column_447': 'test_22',\n",
      " 'column_448': 'test_73',\n",
      " 'column_449': 'test_32',\n",
      " 'column_45': 'test_48',\n",
      " 'column_450': 'test_3',\n",
      " 'column_451': 'test_11',\n",
      " 'column_452': 'test_40',\n",
      " 'column_453': 'test_83',\n",
      " 'column_454': 'test_28',\n",
      " 'column_455': 'test_10',\n",
      " 'column_456': 'test_12',\n",
      " 'column_457': 'test_83',\n",
      " 'column_458': 'test_51',\n",
      " 'column_459': 'test_90',\n",
      " 'column_46': 'test_43',\n",
      " 'column_460': 'test_26',\n",
      " 'column_461': 'test_91',\n",
      " 'column_462': 'test_34',\n",
      " 'column_463': 'test_66',\n",
      " 'column_464': 'test_13',\n",
      " 'column_465': 'test_13',\n",
      " 'column_466': 'test_65',\n",
      " 'column_467': 'test_90',\n",
      " 'column_468': 'test_21',\n",
      " 'column_469': 'test_11',\n",
      " 'column_47': 'test_56',\n",
      " 'column_470': 'test_15',\n",
      " 'column_471': 'test_24',\n",
      " 'column_472': 'test_1',\n",
      " 'column_473': 'test_85',\n",
      " 'column_474': 'test_79',\n",
      " 'column_475': 'test_91',\n",
      " 'column_476': 'test_51',\n",
      " 'column_477': 'test_2',\n",
      " 'column_478': 'test_10',\n",
      " 'column_479': 'test_77',\n",
      " 'column_48': 'test_15',\n",
      " 'column_480': 'test_26',\n",
      " 'column_481': 'test_16',\n",
      " 'column_482': 'test_100',\n",
      " 'column_483': 'test_9',\n",
      " 'column_484': 'test_19',\n",
      " 'column_485': 'test_31',\n",
      " 'column_486': 'test_39',\n",
      " 'column_487': 'test_65',\n",
      " 'column_488': 'test_37',\n",
      " 'column_489': 'test_32',\n",
      " 'column_49': 'test_28',\n",
      " 'column_490': 'test_35',\n",
      " 'column_491': 'test_61',\n",
      " 'column_492': 'test_56',\n",
      " 'column_493': 'test_29',\n",
      " 'column_494': 'test_93',\n",
      " 'column_495': 'test_49',\n",
      " 'column_496': 'test_24',\n",
      " 'column_497': 'test_76',\n",
      " 'column_498': 'test_63',\n",
      " 'column_499': 'test_19',\n",
      " 'column_5': 'test_94',\n",
      " 'column_50': 'test_76',\n",
      " 'column_51': 'test_88',\n",
      " 'column_52': 'test_24',\n",
      " 'column_53': 'test_84',\n",
      " 'column_54': 'test_27',\n",
      " 'column_55': 'test_89',\n",
      " 'column_56': 'test_78',\n",
      " 'column_57': 'test_27',\n",
      " 'column_58': 'test_33',\n",
      " 'column_59': 'test_42',\n",
      " 'column_6': 'test_88',\n",
      " 'column_60': 'test_94',\n",
      " 'column_61': 'test_51',\n",
      " 'column_62': 'test_62',\n",
      " 'column_63': 'test_93',\n",
      " 'column_64': 'test_53',\n",
      " 'column_65': 'test_6',\n",
      " 'column_66': 'test_73',\n",
      " 'column_67': 'test_94',\n",
      " 'column_68': 'test_20',\n",
      " 'column_69': 'test_4',\n",
      " 'column_7': 'test_16',\n",
      " 'column_70': 'test_7',\n",
      " 'column_71': 'test_97',\n",
      " 'column_72': 'test_38',\n",
      " 'column_73': 'test_61',\n",
      " 'column_74': 'test_58',\n",
      " 'column_75': 'test_99',\n",
      " 'column_76': 'test_71',\n",
      " 'column_77': 'test_97',\n",
      " 'column_78': 'test_19',\n",
      " 'column_79': 'test_45',\n",
      " 'column_8': 'test_20',\n",
      " 'column_80': 'test_59',\n",
      " 'column_81': 'test_2',\n",
      " 'column_82': 'test_77',\n",
      " 'column_83': 'test_51',\n",
      " 'column_84': 'test_38',\n",
      " 'column_85': 'test_92',\n",
      " 'column_86': 'test_84',\n",
      " 'column_87': 'test_69',\n",
      " 'column_88': 'test_83',\n",
      " 'column_89': 'test_95',\n",
      " 'column_9': 'test_69',\n",
      " 'column_90': 'test_36',\n",
      " 'column_91': 'test_60',\n",
      " 'column_92': 'test_59',\n",
      " 'column_93': 'test_28',\n",
      " 'column_94': 'test_37',\n",
      " 'column_95': 'test_54',\n",
      " 'column_96': 'test_86',\n",
      " 'column_97': 'test_100',\n",
      " 'column_98': 'test_29',\n",
      " 'column_99': 'test_81',\n",
      " 'float_field': 12,\n",
      " 'int_field': 9,\n",
      " 'list_field': [1, 2, 3],\n",
      " 'name': 'item_66',\n",
      " 'nested_value': {'name': 'item_6', 'value': 0}}\n"
     ]
    }
   ],
   "source": [
    "# Define a simple template data entry\n",
    "template_dict = {\n",
    "    \"float_field\": 10,\n",
    "    \"int_field\": 10,\n",
    "    \"name\": \"item\",\n",
    "    \"nested_value\": {\"value\": 10, \"name\": \"item\"},\n",
    "    \"list_field\": [1, 2, 3],\n",
    "}\n",
    "for x in range(500):\n",
    "    template_dict[f\"column_{x}\"] = \"test\"\n",
    "\n",
    "template = [template_dict]\n",
    "\n",
    "# Generate multiple data entries\n",
    "num_entries = 100000  # Feel free to adjust this\n",
    "data = generate_similar_data(template, num_entries)\n",
    "\n",
    "print(\"Generated Data:\")\n",
    "pprint.pprint(data[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we import the data into our database\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "============================================================\n",
      "PARQUETDB SUMMARY\n",
      "============================================================\n",
      "Database path: ParquetDB\n",
      "\n",
      "• Number of columns: 507\n",
      "• Number of rows: 100000\n",
      "• Number of files: 1\n",
      "• Number of rows per file: [100000]\n",
      "• Number of row groups per file: [4]\n",
      "• Serialized metadata size per file: [225147] Bytes\n",
      "\n",
      "############################################################\n",
      "METADATA\n",
      "############################################################\n",
      "\n",
      "############################################################\n",
      "COLUMN DETAILS\n",
      "############################################################\n",
      "• Columns:\n",
      "    - column_296\n",
      "    - column_302\n",
      "    - column_370\n",
      "    - column_0\n",
      "    - column_80\n",
      "    - column_367\n",
      "    - column_344\n",
      "    - column_216\n",
      "    - column_151\n",
      "    - column_61\n",
      "    - column_439\n",
      "    - column_87\n",
      "    - column_379\n",
      "    - column_351\n",
      "    - column_329\n",
      "    - column_423\n",
      "    - column_97\n",
      "    - column_42\n",
      "    - column_401\n",
      "    - column_355\n",
      "    - column_171\n",
      "    - column_261\n",
      "    - column_125\n",
      "    - column_270\n",
      "    - column_398\n",
      "    - column_17\n",
      "    - column_18\n",
      "    - column_479\n",
      "    - column_186\n",
      "    - column_471\n",
      "    - column_274\n",
      "    - column_236\n",
      "    - column_421\n",
      "    - column_394\n",
      "    - column_369\n",
      "    - column_397\n",
      "    - column_417\n",
      "    - column_432\n",
      "    - column_258\n",
      "    - column_120\n",
      "    - column_178\n",
      "    - column_415\n",
      "    - column_126\n",
      "    - column_477\n",
      "    - column_124\n",
      "    - column_105\n",
      "    - column_360\n",
      "    - column_374\n",
      "    - column_99\n",
      "    - column_314\n",
      "    - column_380\n",
      "    - column_154\n",
      "    - column_320\n",
      "    - name\n",
      "    - column_459\n",
      "    - column_78\n",
      "    - column_15\n",
      "    - column_291\n",
      "    - column_189\n",
      "    - column_229\n",
      "    - column_412\n",
      "    - column_276\n",
      "    - column_96\n",
      "    - column_123\n",
      "    - column_495\n",
      "    - column_358\n",
      "    - column_354\n",
      "    - column_169\n",
      "    - column_156\n",
      "    - column_452\n",
      "    - column_483\n",
      "    - column_170\n",
      "    - column_163\n",
      "    - column_119\n",
      "    - column_180\n",
      "    - column_463\n",
      "    - column_176\n",
      "    - column_349\n",
      "    - column_106\n",
      "    - column_316\n",
      "    - column_442\n",
      "    - column_338\n",
      "    - column_206\n",
      "    - float_field\n",
      "    - column_68\n",
      "    - column_444\n",
      "    - column_222\n",
      "    - column_16\n",
      "    - column_324\n",
      "    - column_58\n",
      "    - list_field\n",
      "    - column_364\n",
      "    - column_226\n",
      "    - column_366\n",
      "    - column_82\n",
      "    - column_323\n",
      "    - column_95\n",
      "    - column_248\n",
      "    - column_437\n",
      "    - column_204\n",
      "    - column_275\n",
      "    - column_160\n",
      "    - column_307\n",
      "    - column_37\n",
      "    - column_116\n",
      "    - column_333\n",
      "    - column_143\n",
      "    - column_436\n",
      "    - column_46\n",
      "    - column_148\n",
      "    - column_376\n",
      "    - column_164\n",
      "    - column_277\n",
      "    - column_278\n",
      "    - column_357\n",
      "    - column_365\n",
      "    - column_212\n",
      "    - column_361\n",
      "    - column_499\n",
      "    - column_85\n",
      "    - column_217\n",
      "    - column_402\n",
      "    - column_157\n",
      "    - column_272\n",
      "    - column_285\n",
      "    - column_426\n",
      "    - column_57\n",
      "    - column_400\n",
      "    - column_240\n",
      "    - nested_value.value\n",
      "    - column_127\n",
      "    - column_89\n",
      "    - column_468\n",
      "    - id\n",
      "    - column_118\n",
      "    - column_469\n",
      "    - column_418\n",
      "    - column_490\n",
      "    - column_150\n",
      "    - column_430\n",
      "    - column_152\n",
      "    - column_455\n",
      "    - column_475\n",
      "    - column_172\n",
      "    - column_146\n",
      "    - column_8\n",
      "    - column_359\n",
      "    - column_420\n",
      "    - column_408\n",
      "    - column_84\n",
      "    - column_210\n",
      "    - column_309\n",
      "    - column_482\n",
      "    - column_102\n",
      "    - column_195\n",
      "    - column_223\n",
      "    - column_194\n",
      "    - column_200\n",
      "    - column_29\n",
      "    - column_158\n",
      "    - column_336\n",
      "    - column_207\n",
      "    - column_414\n",
      "    - column_22\n",
      "    - column_228\n",
      "    - column_407\n",
      "    - column_454\n",
      "    - column_234\n",
      "    - column_322\n",
      "    - column_64\n",
      "    - column_45\n",
      "    - column_310\n",
      "    - column_147\n",
      "    - column_348\n",
      "    - column_38\n",
      "    - column_115\n",
      "    - column_269\n",
      "    - column_167\n",
      "    - column_179\n",
      "    - column_447\n",
      "    - column_330\n",
      "    - column_419\n",
      "    - column_32\n",
      "    - column_94\n",
      "    - column_187\n",
      "    - column_203\n",
      "    - column_14\n",
      "    - column_337\n",
      "    - column_498\n",
      "    - column_98\n",
      "    - column_67\n",
      "    - column_263\n",
      "    - column_235\n",
      "    - column_249\n",
      "    - column_292\n",
      "    - column_63\n",
      "    - column_27\n",
      "    - column_53\n",
      "    - column_26\n",
      "    - column_450\n",
      "    - column_243\n",
      "    - column_44\n",
      "    - column_306\n",
      "    - column_331\n",
      "    - column_315\n",
      "    - column_108\n",
      "    - column_428\n",
      "    - column_383\n",
      "    - column_453\n",
      "    - column_232\n",
      "    - column_460\n",
      "    - column_413\n",
      "    - column_441\n",
      "    - column_3\n",
      "    - column_390\n",
      "    - column_60\n",
      "    - column_117\n",
      "    - column_303\n",
      "    - column_215\n",
      "    - column_91\n",
      "    - column_434\n",
      "    - column_52\n",
      "    - column_145\n",
      "    - column_294\n",
      "    - column_256\n",
      "    - column_470\n",
      "    - column_75\n",
      "    - column_202\n",
      "    - column_451\n",
      "    - column_347\n",
      "    - column_262\n",
      "    - column_144\n",
      "    - column_31\n",
      "    - column_472\n",
      "    - column_213\n",
      "    - column_218\n",
      "    - column_363\n",
      "    - column_456\n",
      "    - column_155\n",
      "    - column_73\n",
      "    - column_484\n",
      "    - column_23\n",
      "    - column_404\n",
      "    - column_431\n",
      "    - column_133\n",
      "    - column_221\n",
      "    - column_440\n",
      "    - column_279\n",
      "    - column_438\n",
      "    - column_429\n",
      "    - column_327\n",
      "    - column_231\n",
      "    - column_350\n",
      "    - column_392\n",
      "    - column_49\n",
      "    - column_25\n",
      "    - column_230\n",
      "    - column_142\n",
      "    - column_284\n",
      "    - column_191\n",
      "    - column_445\n",
      "    - column_140\n",
      "    - column_405\n",
      "    - column_55\n",
      "    - column_93\n",
      "    - column_461\n",
      "    - column_254\n",
      "    - column_340\n",
      "    - column_136\n",
      "    - column_39\n",
      "    - column_138\n",
      "    - column_435\n",
      "    - column_252\n",
      "    - column_128\n",
      "    - column_237\n",
      "    - column_192\n",
      "    - column_188\n",
      "    - column_406\n",
      "    - column_488\n",
      "    - column_224\n",
      "    - column_209\n",
      "    - column_250\n",
      "    - column_121\n",
      "    - column_389\n",
      "    - column_403\n",
      "    - column_50\n",
      "    - column_70\n",
      "    - column_290\n",
      "    - column_198\n",
      "    - column_388\n",
      "    - column_199\n",
      "    - column_36\n",
      "    - column_334\n",
      "    - column_35\n",
      "    - column_377\n",
      "    - column_40\n",
      "    - column_395\n",
      "    - column_185\n",
      "    - column_386\n",
      "    - column_100\n",
      "    - column_111\n",
      "    - column_10\n",
      "    - column_114\n",
      "    - column_239\n",
      "    - column_113\n",
      "    - column_433\n",
      "    - column_473\n",
      "    - column_326\n",
      "    - column_465\n",
      "    - column_7\n",
      "    - column_72\n",
      "    - column_51\n",
      "    - column_2\n",
      "    - column_165\n",
      "    - column_448\n",
      "    - column_183\n",
      "    - column_257\n",
      "    - column_308\n",
      "    - column_384\n",
      "    - column_131\n",
      "    - column_266\n",
      "    - column_5\n",
      "    - column_282\n",
      "    - column_493\n",
      "    - column_104\n",
      "    - column_181\n",
      "    - column_30\n",
      "    - column_491\n",
      "    - column_287\n",
      "    - column_153\n",
      "    - column_242\n",
      "    - column_298\n",
      "    - column_149\n",
      "    - column_265\n",
      "    - column_139\n",
      "    - column_339\n",
      "    - column_385\n",
      "    - column_168\n",
      "    - column_372\n",
      "    - column_411\n",
      "    - column_166\n",
      "    - column_141\n",
      "    - column_11\n",
      "    - column_83\n",
      "    - column_4\n",
      "    - column_47\n",
      "    - column_424\n",
      "    - column_362\n",
      "    - column_214\n",
      "    - column_244\n",
      "    - column_368\n",
      "    - column_474\n",
      "    - column_54\n",
      "    - column_409\n",
      "    - column_174\n",
      "    - column_271\n",
      "    - column_9\n",
      "    - column_129\n",
      "    - column_335\n",
      "    - column_356\n",
      "    - column_299\n",
      "    - column_325\n",
      "    - column_79\n",
      "    - column_193\n",
      "    - column_253\n",
      "    - column_137\n",
      "    - column_328\n",
      "    - column_173\n",
      "    - column_76\n",
      "    - column_56\n",
      "    - column_134\n",
      "    - column_159\n",
      "    - column_24\n",
      "    - column_297\n",
      "    - column_283\n",
      "    - column_449\n",
      "    - column_288\n",
      "    - column_494\n",
      "    - column_422\n",
      "    - column_496\n",
      "    - column_6\n",
      "    - column_74\n",
      "    - column_268\n",
      "    - column_427\n",
      "    - column_382\n",
      "    - column_462\n",
      "    - column_311\n",
      "    - column_273\n",
      "    - column_289\n",
      "    - column_373\n",
      "    - column_247\n",
      "    - column_238\n",
      "    - column_211\n",
      "    - nested_value.name\n",
      "    - column_280\n",
      "    - column_196\n",
      "    - column_487\n",
      "    - column_458\n",
      "    - column_225\n",
      "    - column_295\n",
      "    - column_425\n",
      "    - column_62\n",
      "    - column_457\n",
      "    - column_313\n",
      "    - column_378\n",
      "    - column_109\n",
      "    - column_175\n",
      "    - column_13\n",
      "    - column_182\n",
      "    - column_342\n",
      "    - column_343\n",
      "    - column_12\n",
      "    - column_321\n",
      "    - column_201\n",
      "    - column_485\n",
      "    - column_197\n",
      "    - column_1\n",
      "    - column_177\n",
      "    - column_161\n",
      "    - column_259\n",
      "    - column_305\n",
      "    - column_341\n",
      "    - column_416\n",
      "    - column_130\n",
      "    - column_345\n",
      "    - column_381\n",
      "    - column_208\n",
      "    - column_135\n",
      "    - column_300\n",
      "    - column_66\n",
      "    - column_241\n",
      "    - column_264\n",
      "    - column_467\n",
      "    - column_317\n",
      "    - column_286\n",
      "    - column_86\n",
      "    - column_41\n",
      "    - column_122\n",
      "    - column_255\n",
      "    - column_92\n",
      "    - column_486\n",
      "    - column_69\n",
      "    - column_393\n",
      "    - column_107\n",
      "    - column_466\n",
      "    - column_28\n",
      "    - column_301\n",
      "    - column_220\n",
      "    - column_245\n",
      "    - column_219\n",
      "    - column_375\n",
      "    - column_480\n",
      "    - column_190\n",
      "    - column_112\n",
      "    - column_371\n",
      "    - column_481\n",
      "    - column_132\n",
      "    - column_101\n",
      "    - column_246\n",
      "    - column_332\n",
      "    - column_267\n",
      "    - column_65\n",
      "    - column_353\n",
      "    - column_464\n",
      "    - column_293\n",
      "    - column_48\n",
      "    - column_103\n",
      "    - column_205\n",
      "    - column_19\n",
      "    - column_446\n",
      "    - column_492\n",
      "    - column_43\n",
      "    - column_318\n",
      "    - column_443\n",
      "    - column_59\n",
      "    - int_field\n",
      "    - column_20\n",
      "    - column_281\n",
      "    - column_489\n",
      "    - column_312\n",
      "    - column_184\n",
      "    - column_34\n",
      "    - column_81\n",
      "    - column_77\n",
      "    - column_33\n",
      "    - column_71\n",
      "    - column_497\n",
      "    - column_88\n",
      "    - column_387\n",
      "    - column_319\n",
      "    - column_304\n",
      "    - column_399\n",
      "    - column_227\n",
      "    - column_410\n",
      "    - column_233\n",
      "    - column_260\n",
      "    - column_391\n",
      "    - column_352\n",
      "    - column_110\n",
      "    - column_476\n",
      "    - column_90\n",
      "    - column_21\n",
      "    - column_251\n",
      "    - column_346\n",
      "    - column_396\n",
      "    - column_478\n",
      "    - column_162\n",
      "\n"
     ]
    }
   ],
   "source": [
    "db_path = \"ParquetDB\"\n",
    "if os.path.exists(db_path):\n",
    "    shutil.rmtree(db_path)\n",
    "db = ParquetDB(db_path=db_path)\n",
    "\n",
    "db.create(data)\n",
    "print(db)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      column_296 column_302 column_370  column_0 column_80 column_367  \\\n",
      "0        test_33    test_30    test_29   test_74   test_59    test_43   \n",
      "1        test_62    test_97    test_10   test_29   test_59    test_47   \n",
      "2         test_6    test_65    test_28   test_29    test_6    test_81   \n",
      "3        test_35    test_61    test_57   test_61   test_71    test_92   \n",
      "4        test_56    test_44    test_70   test_18   test_75    test_77   \n",
      "...          ...        ...        ...       ...       ...        ...   \n",
      "99995    test_43    test_48    test_46   test_77   test_26    test_20   \n",
      "99996     test_6    test_45    test_36  test_100   test_47    test_58   \n",
      "99997     test_2    test_33    test_61   test_99   test_64    test_44   \n",
      "99998    test_96    test_15    test_26   test_58   test_35    test_77   \n",
      "99999    test_63    test_87    test_39   test_38   test_59    test_83   \n",
      "\n",
      "      column_344 column_216 column_151 column_61  ... column_352 column_110  \\\n",
      "0        test_78    test_62    test_58   test_51  ...    test_52    test_94   \n",
      "1        test_21    test_39    test_50   test_78  ...    test_32    test_57   \n",
      "2        test_32    test_21    test_34   test_25  ...    test_70    test_46   \n",
      "3        test_66     test_7    test_18    test_3  ...    test_42     test_4   \n",
      "4        test_75    test_51    test_76   test_76  ...    test_16    test_36   \n",
      "...          ...        ...        ...       ...  ...        ...        ...   \n",
      "99995    test_71    test_11    test_43   test_73  ...    test_89    test_27   \n",
      "99996    test_38    test_33    test_82   test_99  ...    test_90     test_3   \n",
      "99997    test_26    test_29    test_19   test_74  ...    test_70    test_50   \n",
      "99998    test_43    test_16    test_54   test_40  ...     test_5    test_15   \n",
      "99999    test_82    test_53    test_46   test_71  ...    test_97    test_74   \n",
      "\n",
      "      column_476 column_90 column_21 column_251 column_346 column_396  \\\n",
      "0        test_51   test_36   test_66    test_77    test_55    test_56   \n",
      "1        test_23   test_36   test_48    test_77    test_86    test_81   \n",
      "2        test_66   test_63   test_71    test_82    test_89    test_16   \n",
      "3        test_26   test_71   test_30    test_56    test_17    test_32   \n",
      "4         test_6   test_66   test_14    test_43    test_23    test_16   \n",
      "...          ...       ...       ...        ...        ...        ...   \n",
      "99995    test_55    test_9   test_38    test_15    test_98    test_99   \n",
      "99996    test_67    test_9   test_61    test_13    test_39    test_25   \n",
      "99997     test_9    test_5   test_48    test_69    test_44    test_11   \n",
      "99998    test_10   test_92    test_5    test_31    test_44    test_84   \n",
      "99999    test_82   test_13   test_54    test_39    test_84    test_51   \n",
      "\n",
      "      column_478 column_162  \n",
      "0        test_10    test_54  \n",
      "1         test_4    test_44  \n",
      "2        test_90    test_75  \n",
      "3        test_17    test_29  \n",
      "4        test_68    test_50  \n",
      "...          ...        ...  \n",
      "99995    test_28    test_58  \n",
      "99996    test_89     test_4  \n",
      "99997    test_95    test_74  \n",
      "99998    test_61    test_23  \n",
      "99999     test_7    test_53  \n",
      "\n",
      "[100000 rows x 507 columns]\n"
     ]
    }
   ],
   "source": [
    "data = None\n",
    "df = db.read().to_pandas()\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Normalize Data Using ParquetDB\n",
    "\n",
    "Next, we'll introduce the `NormalizeConfig` class, which allows you to fine-tune how normalization is performed over the various operations in ParquetDB.\n",
    "\n",
    "### The `NormalizeConfig` Class\n",
    "\n",
    "```python \n",
    "@dataclass\n",
    "class NormalizeConfig:\n",
    "    load_format: str = \"table\"\n",
    "    batch_size: int = 131_072\n",
    "    batch_readahead: int = 16\n",
    "    fragment_readahead: int = 4\n",
    "    fragment_scan_options: Optional[pa.dataset.FragmentScanOptions] = None\n",
    "    use_threads: bool = True\n",
    "    memory_pool: Optional[pa.MemoryPool] = None\n",
    "    filesystem: Optional[fs.FileSystem] = None\n",
    "    file_options: Optional[ds.FileWriteOptions] = None\n",
    "    use_threads: bool = config.parquetdb_config.normalize_kwargs.use_threads\n",
    "    max_partitions: int = config.parquetdb_config.normalize_kwargs.max_partitions\n",
    "    max_open_files: int = config.parquetdb_config.normalize_kwargs.max_open_files\n",
    "    max_rows_per_file: int = config.parquetdb_config.normalize_kwargs.max_rows_per_file\n",
    "    min_rows_per_group: int = (\n",
    "        config.parquetdb_config.normalize_kwargs.min_rows_per_group\n",
    "    )\n",
    "    max_rows_per_group: int = (\n",
    "        config.parquetdb_config.normalize_kwargs.max_rows_per_group\n",
    "    )\n",
    "    file_visitor: Optional[Callable] = None\n",
    "    existing_data_behavior: str = (\n",
    "        config.parquetdb_config.normalize_kwargs.existing_data_behavior\n",
    "    )\n",
    "    create_dir: bool = True\n",
    "\n",
    "```\n",
    "\n",
    "The `NormalizeConfig` data class allows you to fine-tune how normalization is performed. The most important parameters are the following:\n",
    "\n",
    "- **`load_format : str`**  The format of the output dataset. Supported formats are `'table'` and `'batches'` (default: `'table'`).\n",
    "\n",
    "- **`batch_size : int, optional`**  The number of rows to process in each batch (default: `None`).\n",
    "\n",
    "- **`batch_readahead : int, optional`**  The number of batches to read ahead in a file (default: `16`).\n",
    "\n",
    "- **`fragment_readahead : int, optional`**  The number of files to read ahead, improving IO utilization at the cost of RAM usage (default: `4`).\n",
    "\n",
    "- **`max_open_files : int`**  Maximum open files for dataset writing (default: `1024`).\n",
    "\n",
    "- **`max_rows_per_file : int`**  Maximum rows per file (default: `10,000`).\n",
    "\n",
    "- **`min_rows_per_group : int`**  Minimum rows per row group within each file (default: `0`).\n",
    "\n",
    "- **`max_rows_per_group : int`**  Maximum rows per row group within each file (default: `10,000`).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In parquet files, it stores the data in row groups, this allows for batching and parallelization. \n",
    "\n",
    "Below is a diagram of what your data will look like in memory. \n",
    "\n",
    "- Csv files use a row based system, which is inefficient as it does not store similar data contiguously in memory.\n",
    "- Columnar storage is more efficient as it stores similar data contiguously in memory, however batching the data is not so great as data is not stored in chunks.\n",
    "- Parquet files use a Row Group based system, which is more efficient as it stores similar data contiguously, but it also stores in chunks which is great for parallelization.\n",
    "\n",
    "![Row Group Storage](../media/images/row_group_storage.png)\n",
    "\n",
    "\n",
    "Optimizing parameters like the number of rows per row group, how many row groups per file, and how many files to read ahead can help significantly improve speed and memory performance.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the details of the row groups of our current dataset. We can do this by using the `summary`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "============================================================\n",
      "PARQUETDB SUMMARY\n",
      "============================================================\n",
      "Database path: ParquetDB\n",
      "\n",
      "• Number of columns: 507\n",
      "• Number of rows: 100000\n",
      "• Number of files: 1\n",
      "• Number of rows per file: [100000]\n",
      "• Number of row groups per file: [4]\n",
      "• Number of rows per row group per file: \n",
      "    - ParquetDB_0.parquet:\n",
      "        - Row group 0: 32768 rows\n",
      "        - Row group 1: 32768 rows\n",
      "        - Row group 2: 32768 rows\n",
      "        - Row group 3: 1696 rows\n",
      "• Serialized metadata size per file: [225147] Bytes\n",
      "\n",
      "############################################################\n",
      "METADATA\n",
      "############################################################\n",
      "\n",
      "############################################################\n",
      "COLUMN DETAILS\n",
      "############################################################\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(db.summary(show_row_group_metadata=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see that in our first file, we have 4 row groups where there is a maximum rows per group of 32,768. Typically, this is fine but if your system can handle it, it is best the chunk the data into larger groups.\n",
    "\n",
    "A good rule of thumb for these settings should be about 2 GB per file and about 200MB per row group size. This will require some trial and error to find the best settings for your system.\n",
    "\n",
    "Let's normalize the data with `NormalizeConfig` and change it so it is 50000 rows per row group.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "============================================================\n",
      "PARQUETDB SUMMARY\n",
      "============================================================\n",
      "Database path: ParquetDB\n",
      "\n",
      "• Number of columns: 507\n",
      "• Number of rows: 100000\n",
      "• Number of files: 1\n",
      "• Number of rows per file: [100000]\n",
      "• Number of row groups per file: [2]\n",
      "• Number of rows per row group per file: \n",
      "    - ParquetDB_0.parquet:\n",
      "        - Row group 0: 50000 rows\n",
      "        - Row group 1: 50000 rows\n",
      "• Serialized metadata size per file: [136357] Bytes\n",
      "\n",
      "############################################################\n",
      "METADATA\n",
      "############################################################\n",
      "\n",
      "############################################################\n",
      "COLUMN DETAILS\n",
      "############################################################\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from parquetdb import NormalizeConfig\n",
    "\n",
    "normalize_config = NormalizeConfig(min_rows_per_group=50000, max_rows_per_group=50000)\n",
    "\n",
    "db.normalize(normalize_config=normalize_config)\n",
    "\n",
    "print(db.summary(show_row_group_metadata=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have two row groups, each with a maximum of 50,000 rows. \n",
    "\n",
    "In some cases, however, this might be too large—especially if you're working with a wide dataset (e.g., ~4,000 columns). To handle such cases, we can reduce the number of rows per group to 10,000 for better performance.\n",
    "\n",
    "For particularly large datasets, it's also important to fine-tune additional parameters, such as `batch_readahead`, `fragment_readahead`, `load_format=\"batches\"`, and `batch_size`. \n",
    "\n",
    "By default, ParquetDB uses `load_format=\"table\"` and `batch_size=None`, meaning it will attempt to write and read all the data at once. While this approach works well for smaller datasets, it can cause performance bottlenecks when handling larger datasets.\n",
    "\n",
    "To address this, we can set `load_format=\"batches\"` and define `batch_size=5000`. This configuration ensures that data is processed in chunks of 5,000 rows at a time, improving memory management. Additionally, setting `batch_readahead=2` allows ParquetDB to load two batches into memory ahead of processing, further enhancing performance by reducing waiting times.\n",
    "\n",
    "When reading data, ParquetDB processes files sequentially. To optimize this process, we can control how many files are opened and read ahead by setting `fragment_readahead=2`. This ensures that the system reads two files ahead, balancing I/O performance and memory usage.\n",
    "\n",
    "> Note: The batch size can only go as high as the number of rows in a row group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "============================================================\n",
      "PARQUETDB SUMMARY\n",
      "============================================================\n",
      "Database path: ParquetDB\n",
      "\n",
      "• Number of columns: 507\n",
      "• Number of rows: 100000\n",
      "• Number of files: 1\n",
      "• Number of rows per file: [100000]\n",
      "• Number of row groups per file: [10]\n",
      "• Number of rows per row group per file: \n",
      "    - ParquetDB_0.parquet:\n",
      "        - Row group 0: 10000 rows\n",
      "        - Row group 1: 10000 rows\n",
      "        - Row group 2: 10000 rows\n",
      "        - Row group 3: 10000 rows\n",
      "        - Row group 4: 10000 rows\n",
      "        - Row group 5: 10000 rows\n",
      "        - Row group 6: 10000 rows\n",
      "        - Row group 7: 10000 rows\n",
      "        - Row group 8: 10000 rows\n",
      "        - Row group 9: 10000 rows\n",
      "• Serialized metadata size per file: [497366] Bytes\n",
      "\n",
      "############################################################\n",
      "METADATA\n",
      "############################################################\n",
      "\n",
      "############################################################\n",
      "COLUMN DETAILS\n",
      "############################################################\n",
      "\n"
     ]
    }
   ],
   "source": [
    "normalize_config = NormalizeConfig(\n",
    "    load_format=\"batches\",\n",
    "    batch_size=100,\n",
    "    batch_readahead=16,\n",
    "    fragment_readahead=4,\n",
    "    max_rows_per_group=10000,\n",
    "    min_rows_per_group=10000,\n",
    ")\n",
    "\n",
    "db.normalize(normalize_config=normalize_config)\n",
    "\n",
    "print(db.summary(show_row_group_metadata=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finishing\n",
    "\n",
    "Now that we normalized the data, we can see that the data is more evenly distributed across the row groups. Many methods, such as (`read`, `update`, `delete`, `transform`, `update_schema`), in ParquetDB take as an argument a `normalize_config` which allows you to fine-tune the normalization process during these operations.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "parquetdb_dev",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.20"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
